Skip to content

feat: add support for bulk loading#153

Merged
guycipher merged 1 commit intowildcatdb:masterfrom
mehrdad3301:bulk_loading
Feb 13, 2026
Merged

feat: add support for bulk loading#153
guycipher merged 1 commit intowildcatdb:masterfrom
mehrdad3301:bulk_loading

Conversation

@mehrdad3301
Copy link
Contributor

related to issue

What was done

  • Branches: master (no bulk load) vs bulk_loading (BTree BulkPutSorted in flusher + compactor).
  • Tool: wildcatdb/bench, fillseq only.
  • Configs:
    • A – many small flushes: write_buffer_size=262144, num=100000
    • B – medium flushes: write_buffer_size=2097152, num=100000
    • C – few large flushes: write_buffer_size=8388608, num=200000
  • Runs: master = 2 runs per config; bulk_loading = 1 run per config.

Results

Config Branch fillseq ops/sec Duration SSTables Note
A (256KB, 100k) master 113036 / 108259 884ms / 923ms 3 / 4 2 runs
A (256KB, 100k) bulk_loading 110262 906ms 22 different flush count
B (2MB, 100k) master 102249 / 86633 978ms / 1154ms 6 / 1 2 runs, high variance
B (2MB, 100k) bulk_loading 114565 872ms 7
C (8MB, 200k) master 118430 / 73202 1.69s / 2.73s 3 / 2 2 runs, high variance
C (8MB, 200k) bulk_loading 66374 3.01s 8

Summary

  • End-to-end fillseq does not show a clear, consistent win for bulk_loading.
  • SSTable counts differ a lot between branches (e.g. config A: master 3–4 vs bulk_loading 22), so we’re not comparing the same flush/compaction behavior. That makes it hard to attribute throughput differences to bulk load alone.
  • Config B: bulk_loading was faster in the single run; config C: bulk_loading was slower with more SSTables. High variance on master (B and C) suggests more runs would help.

Ask for help

  1. Why might bulk_loading produce more SSTables (e.g. 22 vs 3–4 on config A)? Is there a known difference in when flushes are triggered or how compactions run on the bulk_loading branch that could explain this?
  2. Best way to measure bulk load benefit: Would a flush-only micro-benchmark (e.g. fixed N keys, time only the B-tree build during flush on master vs bulk_loading, same N) be the right next step to isolate the effect of bulk insertion without conflating with flush count / compaction?
  3. Any guidance on making bulk loading show a measurable benefit in real workloads (e.g. recommended write_buffer_size or workload shape), or on code paths to double-check (e.g. slice building, flush trigger conditions) would be very helpful.

Thanks in advance.

@guycipher
Copy link
Member

Bulk loading with actual keys which are sorted will show in benchmarks. Bulk loading shouldnt cause more sstables, it should be the same amount of sstables, bulk loading just creates a new sstable.

@guycipher guycipher merged commit 096ddc9 into wildcatdb:master Feb 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants